智能论文笔记

Hierarchical Capsule Prediction Network for Marketing Campaigns Effect

Zhixuan Chu , Hui Ding , Guang Zeng , Yuchen Huang , Tan Yan , Yulin Kang , Sheng Li

分类： (统计)机器学习 | 机器学习

2022-08-22

营销活动是一系列战略活动，可以促进企业的目标。在真正的工业场景中，营销活动的效果预测非常复杂且具有挑战性，因为通常从观察数据中学到了先验知识，而没有任何营销活动干预。此外，每个主题始终在几个营销活动的干预下同时受到干扰。因此，我们无法轻松解析和评估单个营销活动的效果。据我们所知，目前尚无有效的方法来解决此类问题，即，基于具有多个相互缠绕事件的层次结构对个体级别的预测任务进行建模。在本文中，我们对效果预测任务中涉及的基础解析树的结构进行了深入的分析，并进一步建立了一个层次结构胶囊预测网络（HAPNET）来预测营销活动的影响。基于合成数据和实际数据的广泛结果证明了我们模型比最新方法的优越性，并在实际工业应用中表现出显着的实用性。

translated by 谷歌翻译

STTAR: Surgical Tool Tracking using off-the-shelf Augmented Reality Head-Mounted Displays

Alejandro Martin-Gomez , Haowei Li , Tianyu Song , Sheng Yang , Guangzhi Wang , Hui Ding , Nassir Navab , Zhe Zhao , Mehran Armand

分类：机器人

2022-08-17

使用增强现实（AR）用于导航目的，这表明在手术手术过程中协助医生有益。这些应用通常需要知道外科手术工具和患者的姿势，以提供外科医生在任务执行过程中可以使用的视觉信息。现有的医学级跟踪系统使用放置在手术室内的红外摄像头（OR）来识别感兴趣的对象附加并计算其姿势的复古反射标记。一些市售的AR头式显示器（HMD）使用类似的摄像头进行自定位，手动跟踪和估算对象的深度。这项工作提出了一个使用AR HMD的内置摄像机来准确跟踪复古反射标记的框架，例如在手术过程中使用的标记，而无需集成任何其他组件。该框架还能够同时跟踪多个工具。我们的结果表明，横向翻译的准确度为0.09 +-0.06毫米，可以实现标记的跟踪和检测，纵向翻译的0.42 +-0.32 mm，绕垂直轴旋转的0.80 +-0.39 ver。此外，为了展示所提出的框架的相关性，我们在手术程序的背景下评估了系统的性能。该用例旨在在骨科过程中复制K-Wire插入的场景。为了进行评估，为两名外科医生和一名生物医学研究人员提供了视觉导航，每次都进行了21次注射。该用例的结果提供了与基于AR的导航程序报告的相当精度。

translated by 谷歌翻译

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Zihan Ding , Zi-han Ding , Tianrui Hui , Junshi Huang , Xiaoming Wei , Xiaolin Wei , Si Liu

分类：计算机视觉

2022-08-11

Panoptic叙事接地（PNG）是一项新的任务，其目标是通过静止图像的密集叙事标题来分割事物和内容类别的视觉对象。先前的两阶段方法首先提取了通过现成的全盘分割模型提取分割区域的建议，然后进行粗糙的区域短语匹配，以将每个名词短语的候选区域接地。但是，两阶段的管道通常受到第一阶段低质量建议的性能限制，以及由区域特征池的损失以及为事物和东西类别设计的复杂策略引起的空间细节。为了减轻这些缺点，我们提出了一个单阶段的端到端像素匹配网络（PPMN），该网络将每个短语与其相应的像素直接匹配，而不是区域建议，并通过简单组合输出全段段。因此，我们的模型可以从密集注释的像素色素对的监督而不是稀疏的区域短语对中利用足够，更精细的跨模式语义对应关系。此外，我们还提出了与语言兼容的像素聚合（LCPA）模块，以进一步通过多轮修补剂增强短语特征的判别能力，该简化为每个短语选择最兼容的像素以适应相应的视觉上下文。广泛的实验表明，我们的方法在PNG基准测试中实现了新的最新性能，并具有4.0个绝对平均召回率增长。

translated by 谷歌翻译

Entropy-driven Sampling and Training Scheme for Conditional Diffusion Generation

Shengming Li , Guangcong Zheng , Hui Wang , Taiping Yao , Yang Chen , Shoudong Ding , Xi Li

分类：计算机视觉

2022-06-23

denoisis扩散概率模型（DDPM）能够通过引入独立的噪声吸引分类器来在每次deosoing过程的时间步骤中提供条件梯度指导，从而使有条件的图像从先前的噪声到真实数据。但是，由于分类器能够轻松地区分不完全生成的图像仅具有高级结构的能力，因此梯度是一种类信息指导，倾向于尽早消失，导致从条件生成过程中崩溃到无条件过程。为了解决这个问题，我们从两个角度提出了两种简单但有效的方法。对于抽样程序，我们将预测分布的熵作为指导消失水平的度量，并提出一种熵感知的缩放方法，以适应性地恢复条件语义指导。每个生成样品的％。对于训练阶段，我们提出了熵吸引的优化目标，以减轻噪音数据的过度自信预测。在Imagenet1000 256x256中，我们提出的采样方案和训练有素的分类器（预训练的条件和无条件的DDPM模型可以实现10.89％（4.59至4.59至4.09））和43.5％（12至6.78）FID改善。

translated by 谷歌翻译

Exploring evolution-based & -free protein language models as protein function predictors

Mingyang Hu , Fajie Yuan , Kevin K. Yang , Fusong Ju , Jin Su , Hui Wang , Fei Yang , Qiuyang Ding

分类：人工智能

2022-06-14

大规模蛋白质语言模型（PLM）在蛋白质预测任务中的性能提高，范围从3D结构预测到各种功能预测。特别是，Alphafold（一种开创性的AI系统）可能会重塑结构生物学。但是，尚未探索超出结构预测的AlphaFold，Evoformer的PLM模块的效用。在本文中，我们研究了三个流行PLM的表示能力：ESM-1B（单序），MSA转换器（多个序列比对）和Evoformer（结构），并特别关注Evoformer。具体而言，我们旨在回答以下关键问题：（i）作为Alphafold的一部分，Evoformer是否会产生可预测蛋白质功能的表示形式？（ii）如果是的，可以替换ESM-1B和MSA转换器？（iii）这些PLM多少依赖于进化相关的蛋白质数据？在这方面，他们彼此补充吗？我们通过实证研究以及新的见解和结论来比较这些模型。最后，我们发布代码和数据集以获得可重复性。

translated by 谷歌翻译

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Zihan Ding , Tianrui Hui , Junshi Huang , Xiaoming Wei , Jizhong Han , Si Liu

分类：计算机视觉

2022-06-08

引用视频对象细分旨在预测视频中自然语言表达式引用的对象的前景标签。先前的方法要么取决于3D convnet，要么将附加的2D转向器作为编码器，以提取混合时空特征。但是，由于在解码阶段发生的延迟和隐式时空相互作用，这些方法遭受了空间错位或虚假分散因素的影响。为了解决这些限制，我们提出了一个语言桥梁的双链传输（LBDT）模块，该模块将语言用作中间桥，以在编码阶段早期完成显式和适应性的时空交互。具体地，在时间编码器中进行了交叉模式的注意，将单词和空间编码器引用以汇总和传递与语言相关的运动和外观信息。此外，我们还提出了在解码阶段的双边通道激活（BCA）模块，以通过通道激活进一步降低并突出时空一致的特征。广泛的实验表明，我们的方法在四个流行的基准测试基准上获得了新的最新性能，分别在A2D句子和J-HMDB句子上获得了6.8％和6.9％的绝对AP收益，同时消耗了大约7倍的计算机开销。

translated by 谷歌翻译

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Yifeng Ma , Suzhen Wang , Zhipeng Hu , Changjie Fan , Tangjie Lv , Yu Ding , Zhidong Deng , Xin Yu

分类：计算机视觉

2023-01-03

Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.

translated by 谷歌翻译

Semi-Structured Object Sequence Encoders

Rudra Murthy V , Riyaz Bhat , Chulaka Gunasekara , Hui Wan , Tejas Indulal Dhamecha , Danish Contractor , Marina Danilevsky

分类：计算机视觉 | 人工智能 | 自然语言处理

2023-01-03

In this paper we explore the task of modeling (semi) structured object sequences; in particular we focus our attention on the problem of developing a structure-aware input representation for such sequences. In such sequences, we assume that each structured object is represented by a set of key-value pairs which encode the attributes of the structured object. Given a universe of keys, a sequence of structured objects can then be viewed as an evolution of the values for each key, over time. We encode and construct a sequential representation using the values for a particular key (Temporal Value Modeling - TVM) and then self-attend over the set of key-conditioned value sequences to a create a representation of the structured object sequence (Key Aggregation - KA). We pre-train and fine-tune the two components independently and present an innovative training schedule that interleaves the training of both modules with shared attention heads. We find that this iterative two part-training results in better performance than a unified network with hierarchical encoding as well as over, other methods that use a {\em record-view} representation of the sequence \cite{de2021transformers4rec} or a simple {\em flattened} representation of the sequence. We conduct experiments using real-world data to demonstrate the advantage of interleaving TVM-KA on multiple tasks and detailed ablation studies motivating our modeling choices. We find that our approach performs better than flattening sequence objects and also allows us to operate on significantly larger sequences than existing methods.

translated by 谷歌翻译

Distributed Machine Learning for UAV Swarms: Computing, Sensing, and Semantics

Yahao Ding , Zhaohui Yang , Quoc-Viet Pham , Zhaoyang Zhang , Mohammad Shikh-Bahaei

分类：机器学习 | 人工智能

2023-01-03

Unmanned aerial vehicle (UAV) swarms are considered as a promising technique for next-generation communication networks due to their flexibility, mobility, low cost, and the ability to collaboratively and autonomously provide services. Distributed learning (DL) enables UAV swarms to intelligently provide communication services, multi-directional remote surveillance, and target tracking. In this survey, we first introduce several popular DL algorithms such as federated learning (FL), multi-agent Reinforcement Learning (MARL), distributed inference, and split learning, and present a comprehensive overview of their applications for UAV swarms, such as trajectory design, power control, wireless resource allocation, user assignment, perception, and satellite communications. Then, we present several state-of-the-art applications of UAV swarms in wireless communication systems, such us reconfigurable intelligent surface (RIS), virtual reality (VR), semantic communications, and discuss the problems and challenges that DL-enabled UAV swarms can solve in these applications. Finally, we describe open problems of using DL in UAV swarms and future research directions of DL enabled UAV swarms. In summary, this survey provides a comprehensive survey of various DL applications for UAV swarms in extensive scenarios.

translated by 谷歌翻译

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Jianzong Wu , Xiangtai Li , Henghui Ding , Xia Li , Guangliang Cheng , Yunhai Tong , Chen Change Loy

分类：计算机视觉

2023-01-02

In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.

translated by 谷歌翻译